Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic metric for Containers on Windows #1

Closed

Conversation

KlwntSingh
Copy link
Owner

Description:

Add pod level metric collection for Windows

1. Changed receiver for awscontainerinsights to run for Windows with right
 metric collection sources.
2. Added summary API in kubeletclient
3. Add kubeletProvider to return metrics at different levels i.e. pod, contianer, node.
4. Updated hostInfo providers to run for Windows.

Link to tracking Issue:

Testing:
Tested locally and it worked

Documentation:

@KlwntSingh KlwntSingh changed the base branch from main to aws-cwa-dev December 6, 2023 17:34
if err := nc.osSetenv(goPSUtilProcDirEnv, hostProc); err != nil {
return nil, fmt.Errorf("NodeCapacity cannot set goPSUtilProcDirEnv to %s: %w", hostProc, err)
// On Windows, hostProc is empty
if hostProc != "" {
Copy link

@tzifudzi tzifudzi Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: While the current approach effectively handles the condition, a way to potentially improve readability could be to introduce a variable like isLinux.

For example:

var isLinux = hostProc != ""
if isLinux {
     // Existing logic...
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to have _win/_lin specific files for some of these implementation ones. gopsutil does this
https://github.com/shirou/gopsutil/blob/master/process/process_windows.go

Another simple way but be to have the method call windows or linux sub-implementations within the same file. Having them split out is nice because then you are not required to have the win/lin impls cross-platform compile

package host

// These variables are invalid for Windows
const (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another nit: This works, but could there be other more readable (and maintainable) ways to refactor this and enable the ability of having OS-specific behaviour or properties? Not sure how the codebase is what might work best so would rely on your good judgement.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw similar pattern of defining OS specific const in repo.

@@ -0,0 +1,106 @@
package k8swindows

Copy link

@tzifudzi tzifudzi Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we ignore building this on Linux?

//go:build windows
// +build windows

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, addressed!

return result
}
metrics = k.decorateMetrics(metrics)
for _, cadvisorMetric := range metrics {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can the metric name in iteration be something like k8s_summary_metric?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

tags[ci.AutoScalingGroupNameKey] = c.hostInfo.GetAutoScalingGroupName()

// add tags for EKS
tags[ci.ClusterNameKey] = c.hostInfo.GetClusterName()
Copy link

@tzifudzi tzifudzi Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Seems like you covered the critical tags. Additional tags to consider if not already added. Might edit and add more as I think of more ideas

  • platform e.g. Windows or Linux
  • availability zone and region

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

information about region is already part metrics internally, information about OS will be included in future PRs when refining metric labels

hostIP := os.Getenv("HOST_IP")
kclient, err := kubeletutil.NewKubeletClient(hostIP, ci.KubeSecurePort, logger)
if err != nil {
return nil, errors.New("failed to initialize kubelet client")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve error handling and make debugging easier, can we please include the original error details so that a user will know what happened and how to rectify it?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

}

func new(logger *zap.Logger, info host.Info) (*kubeletSummaryProvider, error) {
hostIP := os.Getenv("HOST_IP")
Copy link

@tzifudzi tzifudzi Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful to check if HOST_IP was set correctly or is this checked later in the code?

Maybe we can add

if hostIP == "" {
  // return error or print warning if value is optional
}

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before adding this validator, I want to verify a usecase of what happens when HOST_IP is not set. Based on results, i can add this additional check. if it fails, then i will add validation but it works then this is not required.

@tzifudzi
Copy link

Does the project contain any unit/integration tests? Will these be added later?


nodeCPUCores := k.hostInfo.GetNumCores()
for _, pod := range summary.Pods {
k.logger.Info(fmt.Sprintf("pod summary %v", pod.PodRef.Name))
Copy link

@tzifudzi tzifudzi Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Is there any way we can also show how many containers are running in the pod? If yes, and you agree it might be useful, lets add it. If not then nevermind.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information will be collected in separate metrics field but not exactly the number of containers rather cpu, memory metrics for each container in pod.

@@ -197,7 +209,9 @@ func (acir *awsContainerInsightReceiver) Shutdown(context.Context) error {
// collectData collects container stats from Amazon ECS Task Metadata Endpoint
func (acir *awsContainerInsightReceiver) collectData(ctx context.Context) error {
var mds []pmetric.Metrics
if acir.cadvisor == nil && acir.k8sapiserver == nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Similar to earlier feedback, maybe something like this might be more readable?

isLinux := acir.k8swindows != nil
isWindows := acir.cadvisor == nil && acir.k8sapiserver == nil && acir.k8swindows == nil

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will address this in future PR since this complex logic might not be required.

@KlwntSingh
Copy link
Owner Author

Does the project contain any unit/integration tests? Will these be added later?

Yes, unit tests will be part of this repo but integ are part of different.
This PR is just to put in code structure. I will start adding unit tests from subsequent PRs

@chadpatel
Copy link

you should branch off of aws-cwa-dev in the contributing repo and merge this to a new branch in that repo instead of merging it into your fork. This will be helpful for having multiple developers and you can take advantage of the workflows and Amazon's github license or whatever for running those workflows

@chadpatel
Copy link

General comment, your implementation for windows CI is really an alternate implementation to the cAdvisor code. cAdvisor as a name is probably over-used, we could rename to kubelet or kubeletStats or kubeletProvider or nodeStats or whatever but I would avoid creating a whole new "windows" thing when you essentially just need to create an alternate implementation of the existing cAdvisor "provider"

This will reduce the code sprawl and minimize the number of windows/linux checks you have to make

@@ -33,7 +33,7 @@ func (c *CPUMetricExtractor) GetValue(info *cInfo.ContainerInfo, mInfo CPUMemInf

// When there is more than one stats point, always use the last one
curStats := GetStats(info)
metric := newCadvisorMetric(containerType, c.logger)
metric := NewCadvisorMetric(containerType, c.logger)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intereseting! I wonder if we need to make the concept of a "cAdvisor metric" more generic, basically it is a container insights metric or container insights system metric. Then that could sit in internal and be shared by both the cAdvisor and windows impl.

For now though I think it is fine to just do this, that is a decent refactor for not a ton of value other than better semantics

@@ -44,7 +44,8 @@ type CAdvisorMetric struct {
logger *zap.Logger
}

func newCadvisorMetric(mType string, logger *zap.Logger) *CAdvisorMetric {
// NewCadvisorMetric is public because used in k8swindows package, outside cadvisor package.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - I would delete this comment, this is useful in a diff but I don't think it is useful for this comment to live on forever. Packages/methods don't need to detail why they are public or private or whatever

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed, removed the comment!

}
if err := nc.osSetenv(goPSUtilProcDirEnv, hostProc); err != nil {
return nil, fmt.Errorf("NodeCapacity cannot set goPSUtilProcDirEnv to %s: %w", hostProc, err)
// On Windows, hostProc is empty

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - I don't love this comment. The check is for hostProc, not windows/linux. The code should be evident that it needs a hostProc, all the reasons why it may not have a valid hostproc (i.e. windows) is not relevant

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

if err := nc.osSetenv(goPSUtilProcDirEnv, hostProc); err != nil {
return nil, fmt.Errorf("NodeCapacity cannot set goPSUtilProcDirEnv to %s: %w", hostProc, err)
// On Windows, hostProc is empty
if hostProc != "" {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to have _win/_lin specific files for some of these implementation ones. gopsutil does this
https://github.com/shirou/gopsutil/blob/master/process/process_windows.go

Another simple way but be to have the method call windows or linux sub-implementations within the same file. Having them split out is nice because then you are not required to have the win/lin impls cross-platform compile

@@ -9,6 +9,7 @@ import (

"go.uber.org/zap"
corev1 "k8s.io/api/core/v1"
stats "k8s.io/kubelet/pkg/apis/stats/v1alpha1"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea on the govuln check?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it something about being alpha?

if err != nil {
return err
}
// On Windows, K8sWindows provider is used which internally uses kubelet summary API for metrics.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is an unnecessary comment

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

@@ -197,7 +209,9 @@ func (acir *awsContainerInsightReceiver) Shutdown(context.Context) error {
// collectData collects container stats from Amazon ECS Task Metadata Endpoint
func (acir *awsContainerInsightReceiver) collectData(ctx context.Context) error {
var mds []pmetric.Metrics
if acir.cadvisor == nil && acir.k8sapiserver == nil {
// On linux, acir.k8swindows will be nil.
// On Windows, both acir.cadvisor and acir.k8sapiserver will be nil.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a useful comment, it is basically documenting your assumption about the entry criteria

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed.

@@ -39,6 +41,7 @@ type awsContainerInsightReceiver struct {
cancel context.CancelFunc
cadvisor metricsProvider
k8sapiserver metricsProvider
k8swindows metricsProvider

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my two cents - k8swindows and cadvisor are the same, no reason to not just populate cadvisor with two different impls, one impl for windows and one impl for linux (actually using cadvisor).

Then you don't need special code in the receiver for k8swindows.GetMetrics()

I am open to renaming it, kubeletProvider or something

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.
I have removed k8swindows from this struct. Renamed cadvisor field to containerMetricProvider and reused it for both cadvisor and k8sWindows implementation.

KlwntSingh pushed a commit that referenced this pull request Dec 15, 2023
…emetry#24676)

**Description:** The metadata.yml for the SSH check receiver currently
documents a resource attribute containing the SSH endpoint but this is
not emitted. This PR updates the receiver to include this resource
attribute.

**Link to tracking Issue:** open-telemetry#24441 

**Testing:**

Example collector config:
```yaml
receivers:
  sshcheck:
    endpoint: 13.245.150.131:22
    username: ec2-user
    key_file: /Users/dewald.dejager/.ssh/sandbox.pem
    collection_interval: 15s
    known_hosts: /Users/dewald.dejager/.ssh/known_hosts
    ignore_host_key: false
    resource_attributes:
      "ssh.endpoint":
        enabled: true

exporters:
  logging:
    verbosity: detailed
  prometheus:
    endpoint: 0.0.0.0:8081
    resource_to_telemetry_conversion:
      enabled: true

service:
  pipelines:
    metrics:
      receivers: [sshcheck]
      exporters: [logging, prometheus]
```

The log output looks like this:
```
2023-07-30T16:52:38.724+0200    info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 1, "metrics": 2, "data points": 2}
2023-07-30T16:52:38.724+0200    info    ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> ssh.endpoint: Str(13.245.150.131:22)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope otelcol/sshcheckreceiver 0.82.0-dev
Metric #0
Descriptor:
     -> Name: sshcheck.duration
     -> Description: Measures the duration of SSH connection.
     -> Unit: ms
     -> DataType: Gauge
NumberDataPoints #0
StartTimestamp: 2023-07-30 14:52:22.381672 +0000 UTC
Timestamp: 2023-07-30 14:52:38.404003 +0000 UTC
Value: 319
Metric #1
Descriptor:
     -> Name: sshcheck.status
     -> Description: 1 if the SSH client successfully connected, otherwise 0.
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: false
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
StartTimestamp: 2023-07-30 14:52:22.381672 +0000 UTC
Timestamp: 2023-07-30 14:52:38.404003 +0000 UTC
Value: 1
```

And the Prometheus metrics look like this:
```
# HELP sshcheck_duration Measures the duration of SSH connection.
# TYPE sshcheck_duration gauge
sshcheck_duration{ssh_endpoint="13.245.150.131:22"} 311
# HELP sshcheck_status 1 if the SSH client successfully connected, otherwise 0.
# TYPE sshcheck_status gauge
sshcheck_status{ssh_endpoint="13.245.150.131:22"} 1
```
KlwntSingh pushed a commit that referenced this pull request Dec 15, 2023
)

**Description:** 

Adding command line argument `--status-code` to `telemetrygen traces`,
which accepts `(Unset,Error,Ok)` (case sensitive) or the enum equivalent
of `(0,1,2)`.

Running 

```shell
telemetrygen traces --otlp-insecure --traces 1 --status-code 1
```

against a minimal local collector yields

```txt
2023-07-29T21:27:57.862+0100	info	ResourceSpans #0
Resource SchemaURL: https://opentelemetry.io/schemas/1.4.0
Resource attributes:
     -> service.name: Str(telemetrygen)
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope telemetrygen
Span #0
    Trace ID       : f6dc4be32c78b9999c69d504a79e68c1
    Parent ID      : 4e2cd6e0e90cf2ea
    ID             : 20835413e32d26a5
    Name           : okey-dokey
    Kind           : Server
    Start time     : 2023-07-29 20:27:57.861602 +0000 UTC
    End time       : 2023-07-29 20:27:57.861726 +0000 UTC
    Status code    : Error
    Status message :
Attributes:
     -> net.peer.ip: Str(1.2.3.4)
     -> peer.service: Str(telemetrygen-client)
Span #1
    Trace ID       : f6dc4be32c78b9999c69d504a79e68c1
    Parent ID      :
    ID             : 4e2cd6e0e90cf2ea
    Name           : lets-go
    Kind           : Client
    Start time     : 2023-07-29 20:27:57.861584 +0000 UTC
    End time       : 2023-07-29 20:27:57.861726 +0000 UTC
    Status code    : Error
    Status message :
Attributes:
     -> net.peer.ip: Str(1.2.3.4)
     -> peer.service: Str(telemetrygen-server)
```

and similarly (the string version)

```shell
telemetrygen traces --otlp-insecure --traces 1 --status-code '"Ok"'
```

produces 

```txt
Resource SchemaURL: https://opentelemetry.io/schemas/1.4.0
Resource attributes:
     -> service.name: Str(telemetrygen)
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope telemetrygen
Span #0
    Trace ID       : dfd830da170acfe567b12f87685d7917
    Parent ID      : 8e15b390dc6a1ccc
    ID             : 165c300130532072
    Name           : okey-dokey
    Kind           : Server
    Start time     : 2023-07-29 20:29:16.026965 +0000 UTC
    End time       : 2023-07-29 20:29:16.027089 +0000 UTC
    Status code    : Ok
    Status message :
Attributes:
     -> net.peer.ip: Str(1.2.3.4)
     -> peer.service: Str(telemetrygen-client)
Span #1
    Trace ID       : dfd830da170acfe567b12f87685d7917
    Parent ID      :
    ID             : 8e15b390dc6a1ccc
    Name           : lets-go
    Kind           : Client
    Start time     : 2023-07-29 20:29:16.026956 +0000 UTC
    End time       : 2023-07-29 20:29:16.027089 +0000 UTC
    Status code    : Ok
    Status message :
Attributes:
     -> net.peer.ip: Str(1.2.3.4)
     -> peer.service: Str(telemetrygen-server)
```

The default is `Unset` which is the current behaviour.

**Link to tracking Issue:**

24286

**Testing:**

Added unit tests which covers both valid and invalid inputs.

**Documentation:**

Command line arguments are self documenting via the usage info in the
flag.

Co-authored-by: Pablo Baeyens <[email protected]>
This PR defines code structure for metric provider which works on Windows.

### Changelog
1. Changed receiver.go in awscontainerinsights to run for Windows with metric provider.
2. Added summary API in kubeletclient
3. Add kubeletProvider to return metrics at different levels i.e. pod, contianer, node.
4. Updated hostInfo providers to run for Windows.
5. Updated ebsVolume Info provider to run for Windows.

## Todos:
1. Define correct ebsVolume Info provider for Windows
2. Change logic around k8s leader election to run for Windows
@KlwntSingh KlwntSingh force-pushed the ci-windows-basic-pod-metrics-kubelet-1 branch from b7a7513 to d575b52 Compare December 16, 2023 19:00
@KlwntSingh
Copy link
Owner Author

General comment, your implementation for windows CI is really an alternate implementation to the cAdvisor code. cAdvisor as a name is probably over-used, we could rename to kubelet or kubeletStats or kubeletProvider or nodeStats or whatever but I would avoid creating a whole new "windows" thing when you essentially just need to create an alternate implementation of the existing cAdvisor "provider"

This will reduce the code sprawl and minimize the number of windows/linux checks you have to make

@KlwntSingh KlwntSingh closed this Dec 16, 2023
@KlwntSingh KlwntSingh reopened this Dec 16, 2023
@KlwntSingh KlwntSingh force-pushed the ci-windows-basic-pod-metrics-kubelet-1 branch from bd19de8 to 97d5c39 Compare December 17, 2023 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants